-
Notifications
You must be signed in to change notification settings - Fork 21
VLM Finetuning support #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
nikita-smetanin
commented
Dec 22, 2025
- Add support for Multimodal datasets in OpenAI-like format
- Add support for Vision-Language model training with optional Vision encoder finetuning
| elif messages_are_multimodal != is_multimodal: | ||
| # Due to the format limitation, we cannot mix multimodal and text only messages in the same sample. | ||
| raise InvalidFileFormatError( | ||
| "Messages in the conversation must be either all in multimodal or all intext only format.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo: ...or all in text-only
| message: The message to check. | ||
| idx: Line number in the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly dont mind this change but i wonder how it got in
|
|
||
| if model_limits.supports_vision: | ||
| # Don't show price estimation for multimodal models yet | ||
| confirm = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i don't have context here, why does this prevent showing the price estimation?
|
|
||
| if model_limits.supports_vision: | ||
| multimodal_params = FinetuneMultimodalParams(train_vision=train_vision) | ||
| elif train_vision: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supernit: i prefer
elif not model_limits.supports_vision and train_vision
here. it's logically the same, but the condition is clearer
| line_number=idx + 1, | ||
| error_source="key_value", | ||
| ) | ||
| if not isinstance(message[column], str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps you can check isinstance(message[column], MessageContent) instead?
| def _check_message_role( | ||
| message: Dict[str, str | bool], previous_role: str | None, idx: int | ||
| ) -> str | bool: | ||
| message: Dict[str, str | int | MessageContent], previous_role: str | None, idx: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is the message an int?
|
|
||
|
|
||
| def _check_message_content( | ||
| message_content: str | int | MessageContent, role: str, idx: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is the message an int?